[SPARK-8119][Scheduler]Do not let Spark set total executors when executor fails #6662

SaintBacchus · 2015-06-05T02:06:47Z

DynamicAllocation will set the total executor to a little number when it wants to kill some executors.
But in no-DynamicAllocation scenario, Spark will also set the total executor.
So it will cause such problem: sometimes an executor fails down, there is no more executor which will be pull up by spark

…utor fails

SparkQA · 2015-06-05T03:56:10Z

Test build #34244 has finished for PR 6662 at commit 016214d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-06-05T03:56:18Z

Test build #34242 has finished for PR 6662 at commit 610c390.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2015-06-05T17:12:06Z

core/src/main/scala/org/apache/spark/scheduler/cluster/YarnSchedulerBackend.scala

Can you explain how this method is called with dynamic allocation disabled?

The only call chain I can find starts with ExecutorAllocationManager, which is not instantiated when dynamic allocation is off.

you can do sc.requestTotalExecutors

You mean:

private[spark] override def requestTotalExecutors

I don't see any calls to it, and given it's private[spark]...

sorry, I meant sc.requestExecutors, which eventually calls the method here.

andrewor14 · 2015-06-05T17:58:50Z

@SaintBacchus can you elaborate on the description a little? I'm not sure if I follow what the symptoms are and how you reproduced them.

SaintBacchus · 2015-06-06T02:17:31Z

@andrewor14 @vanzin I draw a simple call stack, as this:

If the doRequestTotalExecutors logic happened, it reset the total executors of the application.
But there was a prolem: at the monment if alive executor was not the same with origin, the Spark will never pull thenm up again.
This simple scenario can reproduce this issue:

There are 2 applications and each wants 2 executor, so total 4 cup cores wanted(every executor wants one core).
But the RM only has 3 cores, so when first application(A) gained 2 cores and second applicaiton(B) gained only one core waitting A release the cores.
Then kill one of the A's executor, B will pull up its executor and let A wait the resource.
After the TimeOut logic occures in A then B application has finished its job and releases its resource.
As the expection, A wil push its anohter other executor again but actually it will never happen.

A may be a Streaming application.

SaintBacchus · 2015-06-11T05:58:09Z

@andrewor14 did I describe the scenario clearly? can you review it again?

andrewor14 · 2015-06-30T01:08:41Z

I see. The issue is that the AM forgets about the original number of executors it wants after calling sc.killExecutor. Even if dynamic allocation is not enabled, this is still possible because of heartbeat timeouts.

I think the problem is that sc.killExecutor is used incorrectly in HeartbeatReceiver. The intention of the method is to permanently adjust the number of executors the application will get. In HeartbeatReceiver, however, this is used as a best-effort mechanism to ensure that the timed out executor is dead.

andrewor14 · 2015-06-30T01:11:01Z

I have updated the description on the JIRA. However, this patch is definitely not the correct fix. The user should be able to call sc.requestExecutors (a public developer API) even when dynamic allocation is not enabled. This patch disallows that.

I'll submit a fix separately. In the mean time, could you close this PR? Thanks for your work @SaintBacchus.

SaintBacchus · 2015-06-30T01:58:56Z

OK

SaintBacchus and others added 2 commits June 5, 2015 10:03

[SPARK-8119][Scheduler]Do not let Spark set total executors when exec…

610c390

…utor fails

Code format

016214d

vanzin reviewed Jun 5, 2015
View reviewed changes

SaintBacchus closed this Jun 30, 2015

andrewor14 mentioned this pull request Jun 30, 2015

[SPARK-8119] HeartbeatReceiver should replace executors, not kill #7107

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-8119][Scheduler]Do not let Spark set total executors when executor fails #6662

[SPARK-8119][Scheduler]Do not let Spark set total executors when executor fails #6662

Uh oh!

SaintBacchus commented Jun 5, 2015

Uh oh!

SparkQA commented Jun 5, 2015

Uh oh!

SparkQA commented Jun 5, 2015

Uh oh!

vanzin Jun 5, 2015

Uh oh!

andrewor14 Jun 5, 2015

Uh oh!

vanzin Jun 5, 2015

Uh oh!

andrewor14 Jun 5, 2015

Uh oh!

andrewor14 commented Jun 5, 2015

Uh oh!

SaintBacchus commented Jun 6, 2015

Uh oh!

SaintBacchus commented Jun 11, 2015

Uh oh!

andrewor14 commented Jun 30, 2015

Uh oh!

andrewor14 commented Jun 30, 2015

Uh oh!

SaintBacchus commented Jun 30, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-8119][Scheduler]Do not let Spark set total executors when executor fails #6662

[SPARK-8119][Scheduler]Do not let Spark set total executors when executor fails #6662

Uh oh!

Conversation

SaintBacchus commented Jun 5, 2015

Uh oh!

SparkQA commented Jun 5, 2015

Uh oh!

SparkQA commented Jun 5, 2015

Uh oh!

vanzin Jun 5, 2015

Choose a reason for hiding this comment

Uh oh!

andrewor14 Jun 5, 2015

Choose a reason for hiding this comment

Uh oh!

vanzin Jun 5, 2015

Choose a reason for hiding this comment

Uh oh!

andrewor14 Jun 5, 2015

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Jun 5, 2015

Uh oh!

SaintBacchus commented Jun 6, 2015

Uh oh!

SaintBacchus commented Jun 11, 2015

Uh oh!

andrewor14 commented Jun 30, 2015

Uh oh!

andrewor14 commented Jun 30, 2015

Uh oh!

SaintBacchus commented Jun 30, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants